Segway: simultaneous segmentation of multiple functional genomics data sets with heterogeneous patterns of missing data

نویسندگان

  • Michael M. Hoffman
  • Orion J. Buske
  • Jeff A. Bilmes
  • William Stafford Noble
چکیده

New functional genomics methods enabled by high-throughput DNA sequencing have begun to produce an unprecedented amount of data anchored to the genome of humans and other species. We have developed a method to identify joint patterns in the results of multiple classes of functional genomics experiments. The method partitions the genome into variable-length segments using a dynamic Bayesian network where the dynamic (or “time”) axis represents genomic position. Segments are assigned one of a finite number of labels such that the vectors of observations are similar in segments with the same label. A multinet switching structure allows inference on sequences with combinations of missing data in different tracks that vary at each position, without downsampling or interpolation. This permits us to take full advantage of the high-resolution data generated by sequencing assays, working at up to 1-base-pair resolution. Our system can also incorporate other kinds of data into its classification, including lower-resolution continuous data such as microarray data, or discrete data such as the dinucleotide sequence beginning at each position. We demonstrate the use of the method in both unsupervised and semisupervised training of segment parameters.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Selection of Variables that Influence Drug Injection in Prison: Comparison of Methods with Multiple Imputed Data Sets

Background: Prisoners, compared to the general population, are at greater risk of infection. Drug injection is the main route of HIV transmission, in particular in Iran. What would be of interest is to determine variables that govern drug injection among prisoners. However, one of the issues that challenge model building is incomplete national data sets. In this paper, we addressed the process ...

متن کامل

Curve Evolution, Boundary-Value Stochastic Processes, the Mumford-Shah Problem, and Missing Data Applications

We present an estimation-theoretic approach to curve evolution for the Mumford-Shah problem. By viewing an active contour as the set of discontinuities in the Mumford-Shah problem, we may use the corresponding functional to determine gradient descent evolution equations to deform the active contour. In each gradient descent step, we solve a corresponding optimal estimation problem, connecting t...

متن کامل

Probabilistic Linkage of Persian Record with Missing Data

Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...

متن کامل

A committee machine approach for predicting permeability from well log data: a case study from a heterogeneous carbonate reservoir, Balal oil Field, Persian Gulf

Permeability prediction problem has been examined using several methods such as empirical formulas, regression analysis and intelligent systems especially neural networks and fuzzy logic. This study proposes an improved and novel model for predicting permeability from conventional well log data. The methodology is integration of empirical formulas, multiple regression and neuro-fuzzy in a commi...

متن کامل

Investigating the missing data effect on credit scoring rule based models: The case of an Iranian bank

Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009